Search CORE

121 research outputs found

Testing Interestingness Measures in Practice: A Large-Scale Analysis of Buying Patterns

Author: Amer-Yahia Sihem
Kirchgessner Martin
Leroy Vincent
Mishra Shashwat
Publication venue
Publication date: 15/03/2016
Field of study

Understanding customer buying patterns is of great interest to the retail industry and has shown to benefit a wide variety of goals ranging from managing stocks to implementing loyalty programs. Association rule mining is a common technique for extracting correlations such as "people in the South of France buy ros\'e wine" or "customers who buy pat\'e also buy salted butter and sour bread." Unfortunately, sifting through a high number of buying patterns is not useful in practice, because of the predominance of popular products in the top rules. As a result, a number of "interestingness" measures (over 30) have been proposed to rank rules. However, there is no agreement on which measures are more appropriate for retail data. Moreover, since pattern mining algorithms output thousands of association rules for each product, the ability for an analyst to rely on ranking measures to identify the most interesting ones is crucial. In this paper, we develop CAPA (Comparative Analysis of PAtterns), a framework that provides analysts with the ability to compare the outcome of interestingness measures applied to buying patterns in the retail industry. We report on how we used CAPA to compare 34 measures applied to over 1,800 stores of Intermarch\'e, one of the largest food retailers in France

arXiv.org e-Print Archive

Crossref

Hal - Université Grenoble Alpes

Fouille et classement d'ensembles fermés dans des données transactionnelles de grande échelle.

Author: Kirchgessner Martin
Publication venue: HAL CCSD
Publication date: 26/09/2016
Field of study

The recent increase of data volumes raises new challenges for itemset miningalgorithms. In this thesis, we focus on transactional datasets (collections of itemssets, for example supermarket tickets) containing at least a million transactionsover hundreds of thousands items. These datasets usually follow a “long tail”distribution: a few items are very frequent, and most items appear rarely. Suchdistributions are often truncated by existing itemset mining algorithms, whoseresults concern only a very small portion of the available items (the most frequents,usually). Thus, existing methods fail to concisely provide relevant insights on largedatasets. We therefore introduce a new semantics which is more intuitive for theanalyst: browsing associations per item, for any item, and less than a hundredassociations at once.To address the items’ coverage challenge, our first contribution is the item-centric mining problem. It consists in computing, for each item in the dataset,the k most frequent closed itemsets containing this item. We present an algorithmto solve it, TopPI. We show that TopPI computes efficiently interesting resultsover our datasets, outperforming simpler solutions or emulations based on existingalgorithms, both in terms of run-time and result completeness. We also show andempirically validate how TopPI can be parallelized, on multi-core machines andon Hadoop clusters, in order to speed-up computation on large scale datasets.Our second contribution is CAPA, a framework allowing us to study whichexisting measures of association rules’ quality are relevant to rank results. Thisconcerns results obtained from TopPI or from j LCM, our implementation of astate-of-the-art frequent closed itemsets mining algorithm (LCM). Our quantita-tive study shows that the 39 quality measures we compare can be grouped into5 families, based on the similarity of the rankings they produce. We also involvemarketing experts in a qualitative study, in order to discover which of the 5 familieswe propose highlights the most interesting associations for their domain.Our close collaboration with Intermarché, one of our industrial partners in theDatalyse project, allows us to show extensive experiments on real, nation-widesupermarket data. We present a complete analytics workflow addressing this usecase. We also experiment on Web data. Our contributions can be relevant invarious other fields, thanks to the genericity of transactional datasets.Altogether our contributions allow analysts to discover associations of interestin modern datasets. We pave the way for a more reactive discovery of items’ asso-ciations in large-scale datasets, whether on highly dynamic data or for interactiveexploration systems.Les algorithmes actuels pour la fouille d’ensembles fréquents sont dépassés parl’augmentation des volumes de données. Dans cette thèse nous nous intéressonsplus particulièrement aux données transactionnelles (des collections d’ensemblesd’objets, par exemple des tickets de caisse) qui contiennent au moins un mil-lion de transactions portant sur au moins des centaines de milliers d’objets. Lesjeux de données de cette taille suivent généralement une distribution dite en“longue traine”: alors que quelques objets sont très fréquents, la plupart sontrares. Ces distributions sont le plus souvent tronquées par les algorithmes defouille d’ensembles fréquents, dont les résultats ne portent que sur une infimepartie des objets disponibles (les plus fréquents). Les méthodes existantes ne per-mettent donc pas de découvrir des associations concises et pertinentes au seind’un grand jeu de données. Nous proposons donc une nouvelle sémantique, plusintuitive pour l’analyste: parcourir les associations par objet, au plus une centaineà la fois, et ce pour chaque objet présent dans les données.Afin de parvenir à couvrir tous les objets, notre première contribution consisteà définir la fouille centrée sur les objets. Cela consiste à calculer, pour chaqueobjet trouvé dans les données, les k ensembles d’objets les plus fréquents qui lecontiennent. Nous présentons un algorithme effectuant ce calcul, TopPI. Nousmontrons que TopPI calcule efficacement des résultats intéressants sur nos jeuxde données. Il est plus performant que des solutions naives ou des émulationsreposant sur des algorithmes existants, aussi bien en termes de rapidité que decomplétude des résultats. Nous décrivons et expérimentons deux versions par-allèles de TopPI (l’une sur des machines multi-coeurs, l’autre sur des grappesHadoop) qui permettent d’accélerer le calcul à grande échelle.Notre seconde contribution est CAPA, un système permettant d’étudier quellemesure de qualité des règles d’association serait la plus appropriée pour trier nosrésultats. Cela s’applique aussi bien aux résultats issus de TopPI que de j LCM,notre implémentation d’un algorithme récent de fouille d’ensembles fréquents fer-més (LCM). Notre étude quantitative montre que les 39 mesures que nous com-parons peuvent être regroupées en 5 familles, d’après la similarité des classementsde règles qu’elles produisent. Nous invitons aussi des experts en marketing à par-ticiper à une étude qualitative, afin de déterminer laquelle des 5 familles que nousproposons met en avant les associations d’objets les plus pertinentes dans leurdomaine.Notre collaboration avec Intermarché, partenaire industriel dans le cadre duprojet Datalyse, nous permet de présenter des expériences complètes et por-tant sur des données réelles issues de supermarchés dans toute la France. Nousdécrivons un flux d’analyse complet, à même de répondre à cette application. Nousprésentons également des expériences portant sur des données issues d’Internet;grâce à la généricité du modèle des ensembles d’objets, nos contributions peuvents’appliquer dans d’autres domaines.Nos contributions permettent donc aux analystes de découvrir des associations d’objets au milieu de grandes masses de données. Nos travaux ouvrent aussi lavoie vers la fouille d’associations interactive à large échelle, afin d’analyser desdonnées hautement dynamiques ou de réduire la portion du fichier à analyser àcelle qui intéresse le plus l’analyste

Thèses en Ligne

Hal - Université Grenoble Alpes

The effect of enteral and parenteral feeding on secretion of orexigenic peptides in infants

Abstract Background The feeding in the first months of the life seems to influence the risks of obesity and affinity to some diseases including atherosclerosis. The mechanisms of these relations are unknown, however, the modification of hormonal action can likely be taken into account. Therefore, in this study the levels of ghrelin and orexin A - peripheral and central peptide from the orexigenic gut-brain axis were determined. Methods Fasting and one hour after the meal plasma concentrations of ghrelin and orexin were measured in breast-fed (group I; n = 17), milk formula-fed (group II; n = 16) and highly hydrolyzed, hypoallergic formula-fed (group III; n = 14) groups, age matched infants (mean 4 months) as well as in children with iv provision of nutrients (glucose - group IV; n = 15; total parenteral nutrition - group V; n = 14). Peptides were determined using EIA commercial kits. Results Despite the similar caloric intake in orally fed children the fasting ghrelin and orexin levels were significantly lower in the breast-fed children (0.37 ± 0.17 and 1.24 ± 0.29 ng/ml, respectively) than in the remaining groups (0.5 ± 0.27 and 1.64 ± 0.52 ng/ml, respectively in group II and 0.77 ± 0.27 and 2.04 ± 1.1 ng/ml, respectively, in group III). The postprandial concentrations of ghrelin increased to 0.87 ± 0.29 ng/ml, p < 0.002 and 0.76 ± 0.26 ng/ml, p < 0.01 in groups I and II, respectively as compared to fasting values. The decrease in concentration of ghrelin after the meal was observed only in group III (0.47 ± 0.24 ng/ml). The feeding did not influence the orexin concentration. In groups IV and V the ghrelin and orexin levels resembled those in milk formula-fed children. Conclusion The highly hydrolyzed diet strongly affects fasting and postprandial ghrelin and orexin plasma concentrations with possible negative effect on short- and long-time effects on development. Also total parenteral nutrition with the continuous stimulation and lack of fasting/postprandial modulation might be responsible for disturbed development in children fed this way.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Methane emission by Camelids

Author: A Guerouali
A Vallenas
Adam J. Munn
ALF Hellwing
AR Moss
B St-Pierre
BR Carmean
C Kayouli
C Martin
CE Stevens
Cordula Galeffi
CS Pinares-Patiño
CS Pinares-Patiño
Dario Moser
DP Morgavi
E Schulze
FD Sauer
HF Hintz
J Coventry
J Lassen
J Lerner
J Madsen
J Vernet
J-L Wang
JC Wheeler
JHP Hackstein
JP Dulphy
K Meyer
KA Johnson
KL Blaxter
KR Lassey
M Kirchgessner
M Lechner-Doll
M Sponheimer
Marcus Clauss
Marie T. Dittmann
MB Ghali
MB Ghali
Michael Kreuzer
PJ Crutzen
PJ Van Soest
PW Moe
Q Liu
Q Liu
R Franz
R Heller
R Tulgat
Richard A. Lang
Ullrich Runge
W von Engelhardt
WK Saalfeld
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

Methane emissions from ruminant livestock have been intensively studied in order to reduce contribution to the greenhouse effect. Ruminants were found to produce more enteric methane than other mammalian herbivores. As camelids share some features of their digestive anatomy and physiology with ruminants, it has been proposed that they produce similar amounts of methane per unit of body mass. This is of special relevance for countrywide greenhouse gas budgets of countries that harbor large populations of camelids like Australia. However, hardly any quantitative methane emission measurements have been performed in camelids. In order to fill this gap, we carried out respiration chamber measurements with three camelid species (Vicugna pacos, Lama glama, Camelus bactrianus; n = 16 in total), all kept on a diet consisting of food produced from alfalfa only. The camelids produced less methane expressed on the basis of body mass (0.3260.11 L kg21 d21) when compared to literature data on domestic ruminants fed on roughage diets (0.5860.16 L kg21 d21). However, there was no significant difference between the two suborders when methane emission was expressed on the basis of digestible neutral detergent fiber intake (92.7633.9 L kg21 in camelids vs. 86.2612.1 L kg21 in ruminants). This implies that the pathways of methanogenesis forming part of the microbial digestion of fiber in the foregut are similar between the groups, and that the lower methane emission of camelids can be explained by their generally lower relative food intake. Our results suggest that the methane emission of Australia’s feral camels corresponds only to 1 to 2% of the methane amount produced by the countries’ domestic ruminants and that calculations of greenhouse gas budgets of countries with large camelid populations based on equations developed for ruminants are generally overestimating the actual levels

Central Archive at the University of Reading

Public Library of Science (PLOS)

Repository for Publications and Research Data

Crossref

Directory of Open Access Journals

PubMed Central

ZORA

FigShare

Inorganic and organic trace mineral supplementation in weanling pig diets

Author: ALESSANDRO B. AMORIM
ARTHUR JR
BAKER DH
BATAGLIA OC
BURKETT JL
CASE CL
COFFEY RD
CREECH BL
DELVES HT
ECKMANN L
ETHERIDGE RD
GABRIEL M.P. MELO
KIRCHGESSNER M
LEONARDO A.F. PASCOAL
MAHAN DC
MARIA C. THOMAZ
MARTIN RE
MELLO G
MELLOR D
MUNIZ MHB
MURILO M. ASSIS
PEDRO H. WATANABE
REVY PS
RIZAL A. ROBLES-HUAYNATE
ROSTAGNO HS
SUSANA Z. SILVA
UNDERWOOD EJ
URBANO S. RUIZ
VASSALO M
VELOSO JAF
VEUM TL
VIVIAN V. ALMEIDA
WEDEKIND KJ
Publication venue: 'FapUNIFESP (SciELO)'
Publication date
Field of study

Crossref

Invited review: Large-scale indirect measurements for enteric methane emissions in dairy cattle: A review of proxies and their potential for use in management and breeding decisions

Author: Agnew
Agricultural Research Council
Aguinaga Casañas
Alemu
Ann
Antunes-Fernandes
Apache Spark
Arndt
Bannink
Bannink
Barnett
Baskaran
Beauchemin
Beauchemin
Bell
Berry
Biffani
Blaxter
Bouchard
Brask
Calus
Capper
Castro Montoya
Chilliard
Chilliard
Chung
Coffey
Couvreur
D.P. Morgavi
de Haas
De Marchi
Dehareng
Delfosse
Demeyer
Demment
Denman
Dijkstra
Dijkstra
Dijkstra
Donoghue
E. Negussie
Ellis
Ellis
F. Biscarini
F. Dehareng
Fievez
Fitzsimons
Freetly
Froidmont
Garnsworthy
Gengler
Gerber
Gerber
Gill
Goopy
Grainger
Guyader
H. Soyeurt
Haisan
Hammond
Hammond
Harb
Hayes
Hegarty
Hegarty
Henderson
Herd
Herd
Holter
Hristov
Hristov
Hristov
IPCC
Iwamoto
J. Dijkstra
John
Jones
Jonker
Kandel
Kirchgessner
Kittelmann
Knapp
Knight
Kubo
Kuzuhara
Lassen
Lassey
Martin
Martin-Collado
McCartney
McCartney
McCartney
McCartney
McDonnell
Methagene
Meuwissen
Miglior
Mills
Mohammed
Moorby
Moraes
Morgavi
Morgavi
Moss
Murray
Muñoz
N. Gengler
Negussie
Negussie
Newbold
Nielsen
Nkrumah
Okine
Olkin
Pickering
Pinares-Patiño
Pinares-Patiño
Popova
Poulsen
R.J. Dewhurst
Ramin
Rico
Roehe
Romero-Perez
Romero-Perez
Ross
Ross
Rutten
S. van Gastelen
Schirmann
Schwarm
Shi
Shinkai
Simm
Soyeurt
Stergiadis
Sun
T. Yan
TensorFlow
van Gastelen
van Gastelen
van Lingen
van Middelaar
van Middelaar
van Middelaar
van Zijderveld
Vanlierde
Vanlierde
Vanlierde
Vanrobays
Veneman
Vlaeminck
Wall
Wallace
Wallace
Wang
Warner
Watt
Williams
Wirsenius
Wuchter
Y. de Haas
Yan
Yan
Yan
Yan
Yan
Yan
Zhao
Ørskov
Publication venue: 'American Dairy Science Association'
Publication date: 01/01/2017
Field of study

Publication history: Accepted - 7 December 2016; Published online - 1 February 2017.Efforts to reduce the carbon footprint of milk production through selection and management of low-emitting cows require accurate and large-scale measurements of methane (CH4) emissions from individual cows. Several techniques have been developed to measure CH4 in a research setting but most are not suitable for large-scale recording on farm. Several groups have explored proxies (i.e., indicators or indirect traits) for CH4; ideally these should be accurate, inexpensive, and amenable to being recorded individually on a large scale. This review (1) systematically describes the biological basis of current potential CH4 proxies for dairy cattle; (2) assesses the accuracy and predictive power of single proxies and determines the added value of combining proxies; (3) provides a critical evaluation of the relative merit of the main proxies in terms of their simplicity, cost, accuracy, invasiveness, and throughput; and (4) discusses their suitability as selection traits. The proxies range from simple and low-cost measurements such as body weight and high-throughput milk mid-infrared spectroscopy (MIR) to more challenging measures such as rumen morphology, rumen metabolites, or microbiome profiling. Proxies based on rumen samples are generally poor to moderately accurate predictors of CH4, and are costly and difficult to measure routinely onfarm. Proxies related to body weight or milk yield and composition, on the other hand, are relatively simple, inexpensive, and high throughput, and are easier to implement in practice. In particular, milk MIR, along with covariates such as lactation stage, are a promising option for prediction of CH4 emission in dairy cows. No single proxy was found to accurately predict CH4, and combinations of 2 or more proxies are likely to be a better solution. Combining proxies can increase the accuracy of predictions by 15 to 35%, mainly because different proxies describe independent sources of variation in CH4 and one proxy can correct for shortcomings in the other(s). The most important applications of CH4 proxies are in dairy cattle management and breeding for lower environmental impact. When breeding for traits of lower environmental impact, single or multiple proxies can be used as indirect criteria for the breeding objective, but care should be taken to avoid unfavorable correlated responses. Finally, although combinations of proxies appear to provide the most accurate estimates of CH4, the greatest limitation today is the lack of robustness in their general applicability. Future efforts should therefore be directed toward developing combinations of proxies that are robust and applicable across diverse production systems and environments.Technical and financial support from the COST Action FA1302 of the European Union

Wageningen University & Research Publications

Open Repository and Bibliography - Liège

ProdInra

SRUC - Scotland's Rural College

Hal-Diderot

Neuroendocrine control of satiation

Author: Abbott
Abbott
Agnati
Agnati
Andino
Andino
Andrews
Andrews
Anini
Anini
Antin
Antin
Aponte
Aponte
Asarian
Asarian
Asarian
Asarian
Ashford
Ashford
Auestad
Auestad
Azzara
Azzara
Bagdade
Bagdade
Banks
Banks
Baskin
Baskin
Batterham
Batterham
Beck
Beck
Beglinger
Beglinger
Beglinger
Beglinger
Belgardt
Belgardt
Bendotti
Bendotti
Benoit
Benoit
Berglund
Berglund
Berglund
Berglund
Berthoud
Berthoud
Bewick
Bewick
Bezencon
Bezencon
Bi
Bi
Blevins
Blevins
Blevins
Blevins
Blouet
Blouet
Blouet
Blouet
Blundell
Blundell
Blundell
Blundell
Blundell
Blundell
Boden
Boden
Booth
Booth
Bouwknecht
Bouwknecht
Briscoe
Briscoe
Bronstein
Bronstein
Brown
Brown
Bruning
Bruning
Buma
Buma
Buma
Buma
Burdyga
Burdyga
Butler
Butler
Callahan
Callahan
Camerino
Camerino
Campfield LA
Campfield LA
Canabal
Canabal
Cao
Cao
Castonguay
Castonguay
Chavez
Chavez
Chelikani
Chelikani
Chen
Chen
Chen
Chen
Cheung
Cheung
Choi
Choi
Choi
Choi
Chu
Chu
Chua
Chua
Chua
Chua
Claret
Claret
Clark
Clark
Clifton
Clifton
Cohen
Cohen
Coppari
Coppari
Corp
Corp
Cota
Cota
Cotero
Cotero
Covasa
Covasa
Cowley
Cowley
Cox
Cox
Davies
Davies
Davis
Davis
Deblon
Deblon
Degen
Degen
Donohue
Donohue
Dorsomedial
Dorsomedial
Dourish
Dourish
Dryden
Dryden
Dube
Dube
Dunn
Dunn
Dunn
Dunn
Eckel LA
Eckel LA
Eckel LA
Eckel LA
Edfalk
Edfalk
Eerola
Eerola
Elias
Elias
Ellacott
Ellacott
Elmquist
Elmquist
Elmquist
Elmquist
Emond
Emond
Enns
Enns
Ernst
Ernst
Fan
Fan
Farley
Farley
Fioramonti
Fioramonti
Flannery
Flannery
Flynn
Flynn
Freeman
Freeman
Fromme
Fromme
Fu
Fu
Fukuwatari
Fukuwatari
Funakoshi
Funakoshi
Gardiner
Gardiner
Geary
Geary
Geary
Geary
Geary
Geary
Gerspach
Gerspach
Gibbs
Gibbs
Girardet
Girardet
Gonzalez
Gonzalez
Graham
Graham
Gray
Gray
Grignaschi
Grignaschi
Grignaschi
Grignaschi
Grill
Grill
Grill
Grill
Grill
Grill
Grill
Grill
Guy
Guy
Guzman
Guzman
Halford
Halford
Halford
Halford
Halford
Halford
Halford
Halford
Halford
Halford
Hariri
Hariri
Hayes
Hayes
Hayes
Hayes
Heisler
Heisler
Heisler
Heisler
Heisler
Heisler
Hentges
Hentges
Hentges
Hentges
Hill
Hill
Hill
Hill
Hillebrand
Hillebrand
Hirasawa
Hirasawa
Ho
Ho
Horvath
Horvath
Hulsey
Hulsey
Huszar
Huszar
Hwang
Hwang
Ibrahim
Ibrahim
Irani
Irani
Iskandar
Iskandar
Ito
Ito
Jang
Jang
Jarvie
Jarvie
Johnson
Johnson
Kaelin
Kaelin
Kahler
Kahler
Kang
Kang
Kang
Kang
Kaplan
Kaplan
Kask
Kask
Kaye
Kaye
Kennett
Kennett
Khan
Khan
Kim
Kim
Kim
Kim
Kim
Kim
Kim
Kim
Kirchgessner
Kirchgessner
Kiss
Kiss
Kissileff
Kissileff
Kitamura
Kitamura
Koda
Koda
Konner
Konner
Korner
Korner
Korner
Korner
Kotarsky
Kotarsky
Kublaoui
Kublaoui
Lan
Lan
Langhans
Langhans
Lartigue
Lartigue
Le Foll
Le Foll
Le Foll
Le Foll
Le Sauter
Le Sauter
Lee
Lee
Lee
Lee
Lee
Lee
Lee
Lee
Lee
Lee
Leibel
Leibel
Leibowitz
Leibowitz
Leranth
Leranth
Leung
Leung
Levin
Levin
Levine
Levine
Levitsky
Levitsky
Li
Li
Li
Li
Lieverse
Lieverse
Lin
Lin
Liou
Liou
Liou
Liou
Liu
Liu
Lo
Lo
Loftus
Loftus
Lokrantz
Lokrantz
Lori Asarian
Lotter
Lotter
Luquet
Luquet
Lynch
Lynch
MacDonald
MacDonald
Mace
Mace
MacMillan
MacMillan
Makimura
Makimura
Martin
Martin
Matzinger
Matzinger
Mayer
Mayer
Melville
Melville
Mesaros
Mesaros
Meyer
Meyer
Michaud
Michaud
Moran
Moran
Moran
Moran
Morton
Morton
Moulle
Moulle
Murphy
Murphy
Murphy
Murphy
Nelson
Nelson
Nilsson
Nilsson
Niswender
Niswender
Nonogaki
Nonogaki
Obici
Obici
Olszewski
Olszewski
Ono
Ono
Oomura
Oomura
Oomura
Oomura
Oomura
Oomura
Ouaghlidi
Ouaghlidi
Overton
Overton
Owen
Owen
Padilla
Padilla
Palou
Palou
Parker HE
Parker HE
Penicaud
Penicaud
Peters
Peters
Peters
Peters
Plum
Plum
Poeschla
Poeschla
Prete
Prete
Reidelberger
Reidelberger
Reimann
Reimann
Ren
Ren
Rey
Rey
Riedy
Riedy
Rinaman
Rinaman
Rinaman
Rinaman
Rinaman
Rinaman
Rinaman
Rinaman
Rinaman
Rinaman
Rinaman
Rinaman
Rocca
Rocca
Rowe
Rowe
Ruttimann
Ruttimann
Ruttimann
Ruttimann
Sahu
Sahu
Sanacora
Sanacora
Savastano
Savastano
Savontaus
Savontaus
Sawchenko
Sawchenko
Sawchenko
Sawchenko
Sawchenko
Sawchenko
Schwartz
Schwartz
Schwartz
Schwartz
Schwartz
Schwartz
Schwartz
Schwartz
Sclafani
Sclafani
Sclafani
Sclafani
Seeley
Seeley
Seeley
Seeley
Shibasaki
Shibasaki
Shu
Shu
Shutter
Shutter
Simons
Simons
Sinha
Sinha
Skibicka
Skibicka
Smith
Smith
Sohn
Sohn
Song
Song
Stadlbauer
Stadlbauer
Stallone
Stallone
Stanley
Stanley
Stanley
Stanley
Stearns
Stearns
Steinert
Steinert
Stricker
Stricker
Strubbe
Strubbe
Strubbe
Strubbe
Strubbe
Strubbe
Sykes
Sykes
Tabarin
Tabarin
Takayanagi
Takayanagi
Takiguchi
Takiguchi
Tanaka
Tanaka
Taylor
Taylor
Thomas
Thomas
Thomas Bächler
Tolle
Tolle
Tolson
Tolson
Tordoff
Tordoff
Tsigos
Tsigos
Tung
Tung
Uchoa
Uchoa
Vaisse
Vaisse
van
van
Vanderweele
Vanderweele
Varma
Varma
Veening
Veening
Vickers
Vickers
Vickers
Vickers
Vincent
Vincent
Walter
Walter
Wan
Wan
Wang
Wang
Wang
Wang
Wang
Wang
Wellman
Wellman
Williams
Williams
Williams
Williams
Williams
Williams
Williams
Williams
Woods
Woods
Woods
Woods
Woods
Woods
Woods
Woods
Wu
Wu
Xu
Xu
Xu
Xu
Xu
Xu
Yang
Yang
Yoshida
Yoshida
Yoshikawa
Yoshikawa
Yosten
Yosten
Zhan
Zhan
Zhang
Zhang
Zhang
Zhang
Zhang
Zhang
Zhang
Zhang
Zheng
Zheng
Zhou
Zhou
Zhu
Zhu
Publication venue: 'Walter de Gruyter GmbH'
Publication date
Field of study

Crossref

Mining and ranking closed itemsets from large-scale transactional datasets

Author: Kirchgessner Martin
Publication venue
Publication date: 26/09/2016
Field of study

Les algorithmes actuels pour la fouille d’ensembles fréquents sont dépassés par l’augmentation des volumes de données. Dans cette thèse nous nous intéressons plus particulièrement aux données transactionnelles (des collections d’ensembles d’objets, par exemple des tickets de caisse) qui contiennent au moins un million de transactions portant sur au moins des centaines de milliers d’objets. Les jeux de données de cette taille suivent généralement une distribution dite en "longue traine": alors que quelques objets sont très fréquents, la plupart sont rares. Ces distributions sont le plus souvent tronquées par les algorithmes de fouille d’ensembles fréquents, dont les résultats ne portent que sur une infime partie des objets disponibles (les plus fréquents). Les méthodes existantes ne permettent donc pas de découvrir des associations concises et pertinentes au sein d’un grand jeu de données. Nous proposons donc une nouvelle sémantique, plus intuitive pour l’analyste: parcourir les associations par objet, au plus une centaine à la fois, et ce pour chaque objet présent dans les données.Afin de parvenir à couvrir tous les objets, notre première contribution consiste à définir la fouille centrée sur les objets. Cela consiste à calculer, pour chaque objet trouvé dans les données, les k ensembles d’objets les plus fréquents qui le contiennent. Nous présentons un algorithme effectuant ce calcul, TopPI. Nous montrons que TopPI calcule efficacement des résultats intéressants sur nos jeux de données. Il est plus performant que des solutions naives ou des émulations reposant sur des algorithms existants, aussi bien en termes de rapidité que de complétude des résultats. Nous décrivons et expérimentons deux versions parallèles de TopPI (l’une sur des machines multi-coeurs, l’autre sur des grappes Hadoop) qui permettent d’accélerer le calcul à grande échelle.Notre seconde contribution est CAPA, un système permettant d’étudier quelle mesure de qualité des règles d’association serait la plus appropriée pour trier nos résultats. Cela s’applique aussi bien aux résultats issus de TopPI que de jLCM, notre implémentation d’un algorithme récent de fouille d’ensembles fréquents fermés (LCM). Notre étude quantitative montre que les 39 mesures que nous comparons peuvent être regroupées en 5 familles, d’après la similarité des classements de règles qu’elles produisent. Nous invitons aussi des experts en marketing à participer à une étude qualitative, afin de déterminer laquelle des 5 familles que nous proposons met en avant les associations d’objets les plus pertinentes dans leur domaine.Notre collaboration avec Intermarché, partenaire industriel dans le cadre du projet Datalyse, nous permet de présenter des expériences complètes et portant sur des données réelles issues de supermarchés dans toute la France. Nous décrivons un flux d’analyse complet, à même de répondre à cette application. Nous présentons également des expériences portant sur des données issues d’Internet; grâce à la généricité du modèle des ensembles d’objets, nos contributions peuvent s’appliquer dans d’autres domaines.Nos contributions permettent donc aux analystes de découvrir des associations d’objets au milieu de grandes masses de données. Nos travaux ouvrent aussi la voie vers la fouille d’associations interactive à large échelle, afin d’analyser des données hautement dynamiques ou de réduire la portion du fichier à analyser à celle qui intéresse le plus l’analyste.The recent increase of data volumes raises new challenges for itemset mining algorithms. In this thesis, we focus on transactional datasets (collections of items sets, for example supermarket tickets) containing at least a million transactions over hundreds of thousands items. These datasets usually follow a "long tail" distribution: a few items are very frequent, and most items appear rarely. Such distributions are often truncated by existing itemset mining algorithms, whose results concern only a very small portion of the available items (the most frequents, usually). Thus, existing methods fail to concisely provide relevant insights on large datasets. We therefore introduce a new semantics which is more intuitive for the analyst: browsing associations per item, for any item, and less than a hundred associations at once.To address the items' coverage challenge, our first contribution is the item-centric mining problem. It consists in computing, for each item in the dataset, the k most frequent closed itemsets containing this item. We present an algorithm to solve it, TopPI. We show that TopPI computes efficiently interesting results over our datasets, outperforming simpler solutions or emulations based on existing algorithms, both in terms of run-time and result completeness. We also show and empirically validate how TopPI can be parallelized, on multi-core machines and on Hadoop clusters, in order to speed-up computation on large scale datasets.Our second contribution is CAPA, a framework allowing us to study which existing measures of association rules' quality are relevant to rank results. This concerns results obtained from TopPI or from jLCM, our implementation of a state-of-the-art frequent closed itemsets mining algorithm (LCM). Our quantitative study shows that the 39 quality measures we compare can be grouped into 5 families, based on the similarity of the rankings they produce. We also involve marketing experts in a qualitative study, in order to discover which of the 5 families we propose highlights the most interesting associations for their domain.Our close collaboration with Intermarché, one of our industrial partners in the Datalyse project, allows us to show extensive experiments on real, nation-wide supermarket data. We present a complete analytics workflow addressing this use case. We also experiment on Web data. Our contributions can be relevant in various other fields, thanks to the genericity of transactional datasets.Altogether our contributions allow analysts to discover associations of interest in modern datasets. We pave the way for a more reactive discovery of items' associations in large-scale datasets, whether on highly dynamic data or for interactive exploration systems

Theses.fr

Zur Wirkung von Citronensäure im Stoffwechsel

Author: Bhaduri
Kallen
Kirchgessner
Kirchgessner
Kirchgessner
Kirchgessner
Martin
Schwarz
Singer
Vagelos
Weber
Publication venue: 'Wiley'
Publication date
Field of study

Crossref

TopPI: An efficient algorithm for item-centric mining

Author: Amer-Yahia Sihem
Kirchgessner Martin
Leroy Vincent
Termier Alexandre
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

International audienceIn this paper, we introduce item-centric mining, a new semantics for mining long-tailed datasets. Our algorithm, TopPI, finds for each item its top-k most frequent closed itemsets. While most mining algorithms focus on the globally most frequent itemsets, TopPI guarantees that each item is represented in the results, regardless of its frequency in the database. TopPI allows users to efficiently explore Web data, answering questions such as " what are the k most common sets of songs downloaded together with the ones of my favorite artist? ". When processing retail data consisting of 55 million supermarket receipts, TopPI finds the itemset " milk, puff pastry " that appears 10,315 times, but also " frangipane, puff pastry " and " nori seaweed, wasabi, sushi rice " that occur only 1120 and 163 times, respectively. Our experiments with analysts from the marketing department of our retail partner, demonstrate that item-centric mining discover valuable itemsets. We also show that TopPI can serve as a building-block to approximate complex itemset ranking measures such as the p-value. Thanks to efficient enumeration and pruning strategies, TopPI avoids the search space explosion induced by mining low support itemsets. We show how TopPI can be parallelized on multi-cores and distributed on Hadoop clusters. Our experiments on datasets with different characteristics show the superiority of TopPI when compared to standard top-k solutions, and to Parallel FP-Growth, its closest competitor

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

HAL-Rennes 1